In this report we analyse origin-destination flux matrices based on the next (most frequent) definition. We fit each model to the total BBC mobility data set (\(\Omega^{T}\)) and three stratifications by employment status (\(\Omega^{U}\),\(\Omega^{Ed}\), \(\Omega^{Em}\), \(\Omega^{N}\)), age of user (\(\Omega^{U}\),\(\Omega^{18-30}\), \(\Omega^{30-60}\), \(\Omega^{60-100}\)) and member nation of the UK (\(\Omega^{E}\),\(\Omega^{W}\), \(\Omega^{S}\), \(\Omega^{NI}\)).
We compare the estimated mobility models to estimates from the 2011 census commuting flow data for England (\(\Omega^{CE}\)), Wales (\(\Omega^{CW}\)), Scotland (\(\Omega^{CS}\)) and Northern Ireland (\(\Omega^{CNI}\)).
We estimate posterior distributions for each model using hamiltonian MCMC (as implemented by the Stan package http://mc-stan.org/). To assess model fit and provide a basis for model selection we use approximate leave-one-out cross validation as implemented in the loo package (doi:10.1007/s11222-016-9696-4).
The per capita probability of moving to a different LAD each day varies by category:
## # A tibble: 5 x 5
## employment_cat N movers p_move cat_prop
## <chr> <int> <int> <dbl> <dbl>
## 1 Under 18 2955 724 0.245 0.0683
## 2 Education 3511 1101 0.314 0.0811
## 3 Employed 30500 14710 0.482 0.705
## 4 NEET 6325 1642 0.260 0.146
## 5 Total 43291 18177 0.420 NA
## # A tibble: 4 x 5
## age_cat N movers p_move cat_prop
## <chr> <int> <int> <dbl> <dbl>
## 1 Under 18 2955 724 0.245 0.0683
## 2 18-30 9611 4015 0.418 0.222
## 3 30-60 26009 11948 0.459 0.601
## 4 60-100 4716 1490 0.316 0.109
but also by LAD.
##
## Call:
## lm(formula = df$p_move ~ df$census_p_move)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30521 -0.04598 -0.00478 0.04149 0.38498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10242 0.01129 9.07 <2e-16 ***
## df$census_p_move 0.79342 0.02551 31.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07962 on 389 degrees of freedom
## Multiple R-squared: 0.7133, Adjusted R-squared: 0.7125
## F-statistic: 967.7 on 1 and 389 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 0.08022039 0.1246244
## df$census_p_move 0.74327115 0.8435639
A linear regression of \(p_{BBC}\) against \(p_{C}\) demonstrates a strong linear relationship between the probability of moving as estimated from census data and the BBC total data set (Adjusted R-squared 0.71). The probability of moving to a different LAD per day is \(\sim\) 10% (7-12% 95% CI) greater in the BBC data set.
The coverage of the BBC mobility data set - with a median of 81 users per LAD (range 2-948) - means for the majority of LADS the raw data is too sparse to estimate movement rates for each strata of the BBC model. To address this, we estimate a generalised linear model (with logit link and random effects at the LAD level) to model the per LAD probability of moving and how this is adjusted for each strata (age or employment status). We estimate the random effects models using the lme4 package.
\[ p_{BBC} \sim~ group + (1 | LAD) \]
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(move, N - move) ~ group + (1 | LAD)
## Data: age_df
##
## AIC BIC logLik deviance df.resid
## 7066.5 7093.3 -3528.2 7056.5 1559
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1694 -0.6848 -0.0190 0.6354 3.6237
##
## Random effects:
## Groups Name Variance Std.Dev.
## LAD (Intercept) 0.3552 0.596
## Number of obs: 1564, groups: LAD, 391
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.11036 0.05401 -20.560 < 2e-16 ***
## group18-30 0.85922 0.04981 17.249 < 2e-16 ***
## group30-60 0.97816 0.04626 21.143 < 2e-16 ***
## group60-100 0.34333 0.05503 6.238 4.42e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) g18-30 g30-60
## group18-30 -0.730
## group30-60 -0.785 0.858
## group60-100 -0.660 0.715 0.770
## # A tibble: 3 x 2
## group or
## <chr> <chr>
## 1 18-30 2.36 (2.14,2.6)
## 2 30-60 2.66 (2.43,2.91)
## 3 60-100 1.41 (1.27,1.57)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(move, N - move) ~ group + (1 | LAD)
## Data: emp_df
##
## AIC BIC logLik deviance df.resid
## 6713.9 6740.7 -3352.0 6703.9 1559
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.3905 -0.6224 0.0000 0.6582 3.4142
##
## Random effects:
## Groups Name Variance Std.Dev.
## LAD (Intercept) 0.3502 0.5918
## Number of obs: 1564, groups: LAD, 391
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.10849 0.05388 -20.572 <2e-16 ***
## groupEducation 0.51151 0.05903 8.665 <2e-16 ***
## groupEmployed 1.07800 0.04604 23.416 <2e-16 ***
## groupNEET 0.05724 0.05333 1.073 0.283
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) grpEdc grpEmp
## groupEductn -0.616
## groupEmplyd -0.790 0.728
## groupNEET -0.683 0.622 0.799
## # A tibble: 3 x 2
## group or
## <chr> <chr>
## 1 Education 1.67 (1.49,1.87)
## 2 Employed 2.94 (2.69,3.22)
## 3 NEET 1.06 (0.95,1.18)
Raw flux matrices comparing the census work-flow data set (A) to stratified BBC flux matrices.
The approximate leave-one-out cross validation (LOO) method uses pareto smoothed importance sampling (PSIS-LOO) to efficiently estimate the predictive accuracy of a model (expected log pointwise predictive density, \(\hat{elpd}\)) and as a basis for model comparison and selection. The estimated shape parameter \(\hat{k}\) can be used to judge the reliability of the estimate of \(\hat{elpd}\) for each data point (or in our case for each LAD corresponding to a row of \(\Omega_{ji}\)). The estimate of \(\hat{elpd}\) is considered reliable (quick convergence) for \(\hat{k} < 0.5\), performance may still be reliable for values of \(\hat{k}\) up to 0.7. Values of \(\hat{k} > 0.7\) suggest that the data points are highly influential to the estimated posterior and potentially introducing bias.
The highland LAD fails PSIS diagnostic checks with a value of \(\hat{k}>7\).
Comparison of a model fitted to the full 32 Scottish LADS to a reduced data set with Highlands removed (31 LADS) illustrates the systematic bias introduced on the distance scaling (\(\rho\) parameter). Although posterior distributions are overlapping we remove the highland LAD from the data set for inference to avoid affecting model comparison.
The difference between \(\hat{elpd}\) for alternative models fitted to the same data provides a measure of their relative predictive accuracy.
## model elpd model elpd model elpd model
## 1 ERad 0 (0) CDO 0 (0) CDO 0 (0) CDO
## 2 CDO -11500 (1100) CDP -23.2 (7.8) CDP -16.4 (10) CDE
## 3 CDP -12300 (1100) CDE -82.5 (17) IO -239 (51) CDP
## 4 IO -35400 (640) IO -278 (32) ERad -256 (45) IO
## 5 CDE -41400 (850) ERad -302 (26) CDE -351 (37) ERad
## 6 Imp -60900 (880) Imp -392 (31) Imp -465 (44) Imp
## 7 Stoufer -79300 (960) Stoufer -486 (33) Stoufer -644 (44) Stoufer
## elpd
## 1 0 (0)
## 2 -4.18 (3.2)
## 3 -32.6 (9.4)
## 4 -50.9 (8.2)
## 5 -65 (7.2)
## 6 -78.9 (5.1)
## 7 -98.8 (6.3)
## model elpd model elpd model elpd model
## 1 CDO 0 (0) CDO 0 (0) CDO 0 (0) IO
## 2 CDE -287 (41) CDE -3.64 (2.7) CDE -12.5 (8.6) CDP
## 3 ERad -319 (76) CDP -9.56 (4.4) CDP -14.3 (10) CDO
## 4 CDP -2270 (110) IO -46.6 (9.2) IO -34.1 (16) ERad
## 5 IO -3230 (110) ERad -51.2 (8.2) ERad -45.6 (19) CDE
## 6 Imp -6330 (160) Imp -80.1 (11) Imp -123 (23) Imp
## 7 Stoufer -10400 (220) Stoufer -151 (14) Stoufer -220 (27) Stoufer
## elpd
## 1 0 (0)
## 2 -0.562 (2.7)
## 3 -1.9 (2.9)
## 4 -4.9 (2.1)
## 5 -4.96 (3.8)
## 6 -27 (5)
## 7 -42.2 (6.3)
## model elpd model elpd model elpd model
## 1 CDO 0 (0) CDO 0 (0) CDO 0 (0) CDO
## 2 CDE -299 (43) CDE -21.1 (7.1) CDE -19.9 (6.4) CDE
## 3 ERad -438 (81) ERad -67 (18) ERad -33.7 (20) ERad
## 4 CDP -2950 (150) CDP -154 (28) CDP -298 (38) CDP
## 5 IO -4130 (120) IO -395 (48) IO -530 (44) IO
## 6 Imp -7580 (170) Imp -795 (42) Imp -803 (44) Imp
## 7 Stoufer -12500 (250) Stoufer -1340 (66) Stoufer -1660 (80) Stoufer
## elpd model elpd
## 1 0 (0) CDO 0 (0)
## 2 -297 (40) CDE -27.9 (9.6)
## 3 -449 (73) ERad -98.7 (26)
## 4 -2410 (140) CDP -525 (52)
## 5 -3520 (120) IO -872 (52)
## 6 -6600 (160) Imp -1230 (57)
## 7 -11300 (240) Stoufer -2560 (91)
## model elpd model elpd model elpd model
## 1 CDO 0 (0) CDO 0 (0) CDO 0 (0) CDO
## 2 CDE -21.1 (7.1) CDE -95.8 (18) CDE -247 (35) CDE
## 3 ERad -67 (18) ERad -182 (38) ERad -336 (67) ERad
## 4 CDP -154 (28) CDP -843 (89) CDP -2100 (130) CDP
## 5 IO -395 (48) IO -1350 (82) IO -3130 (110) IO
## 6 Imp -795 (42) Imp -2450 (94) Imp -5710 (150) Imp
## 7 Stoufer -1340 (66) Stoufer -4850 (170) Stoufer -10100 (220) Stoufer
## elpd
## 1 0 (0)
## 2 -30.3 (9.4)
## 3 -115 (21)
## 4 -397 (41)
## 5 -749 (48)
## 6 -1080 (52)
## 7 -2330 (88)
The CDO model (Competing destinations with offset) is favoured (or has indistinguishable predictive performance) for each data set.